-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH/API: Add count parameter to limit generator in Series, DataFrame, and DataFrame.from_records() #5898
Conversation
Add count parameter to Series, DataFrame, and DataFrame.from_records(). When reading data from a generator type collection only reads the firsts count values. Some refactor in DataFrame.from_records(). Add tests too.
can u give an actual use case of this? |
You can do random walks from infinite random number generator. I't focused on the solution of 2305, to load data directly and in a efficient manner you need to know how many memory you will need, and generators and iterators are of indefinite length so aditional help it's needed. |
not a big fan of adding a keyword to the constructors maybe a better way is to allow data to be a callable them u can imbed islice if you wanted too |
At the very least, better to make this a classmethod |
me third, wrong approach. support reduced memory data loading from an iterator of known length viz. recent discussion in #2193, we like the idea done as a new class method.
I planned to try this myself for 0.14, but if you beat me to it so much the better. This I'm closing. |
p.s. |
side note - why is it useful to not create an intermediate list or ndarray? I assure you that even if you pass in a generator, eventually it's going to be converted into list or ndarray, then have its types massaged, etc. The sugar of pandas does sacrifice some memory efficiency in loading and manipulating data. It's a tradeoff and sometimes you might need to stick with numpy for really performance critical parts. |
@y-p go fot it, currently I'm only a hobby programmer and don't have many time available to make it, but I'm happy to help on anything in my little spare time. I have been thinking on this problem of memory consumtion for months, and I came to the same task list as you, actually this PR match the number 1 task, it was thought for it. I made this on hollydays time, couse I didn't have enought time to make all the code I've filled GH5902 to 'try' to express my ideas on the rest of the process. Unluckily I'm not a native english speaker and I dont express myself as good as I want to. This PR it's focus on the API changes: add of a 'count' attribute. Maybe I aim to a too broad inclusion of the attibute for consistency but anyway you are going to need a 'count' attribute becouse generatos/iterators are not sized objects, they have no On memory consumtion measure.Time ago I try with IPython notebooks and memit, but I prefer to use common sense.
In case the generator yields more results than the count, you can simply ignore them. In case the generator exausted before count resize the ndarray to the correct size. As it only downsize the ndarray there will not be copy/move operations in memory so no performance penalty for it. |
When reading data from a generator type collection only reads the firsts number (count) of values.
First step to solve #2305, known the length of the data to allocate memory.